library(rpart)
library(rpart.plot)
library(ISLR)
library(knitr)
library(caret)
library(MASS)
Denote for the node \(T\)
Regression trees use splitting criterion different from classification ones. In the anova method the splitting criteria is \[SS_T - (SS_L + SS_R)\] where \(SS_L\) and \(SS_R\) are the sums of squares for the left and right son of \(T\) respectively.
The anova method leads to regression trees; it is the default method if target variable is a simple numeric vector, i.e., not a factor, matrix, or survival object.
Use the Hitters data from the ISLR library for a simple example
data("Hitters")
# Remove incomplete cases
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
kable(head(Hitters,3))
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -Alan Ashby | 315 | 81 | 7 | 24 | 38 | 39 | 14 | 3449 | 835 | 69 | 321 | 414 | 375 | N | W | 632 | 43 | 10 | 475 | N |
| -Alvin Davis | 479 | 130 | 18 | 66 | 72 | 76 | 3 | 1624 | 457 | 63 | 224 | 266 | 263 | A | W | 880 | 82 | 14 | 480 | A |
| -Andre Dawson | 496 | 141 | 20 | 65 | 78 | 37 | 11 | 5628 | 1575 | 225 | 828 | 838 | 354 | N | E | 200 | 11 | 3 | 500 | N |
Predict a baseball player’s Salary based on the number of Years spent in the major league, and the number of Hits in the previous year.
Salary is measured in thousands of dollars.
Fit the model
salaryFit <- rpart(Salary~Hits+Years, data=Hitters)
The regression tree can be visualized by prp().
prp(salaryFit,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
Interpret the tree.
Which factors are the most important for size of compensation? Are there any questionable results of the tree model?
#Under 4.5 years and under 42 its salary is higher than under 3.5 years and any number of hits
#It is also higher than under 4.5 years and any number of hits.
Tree with too small terminals is unreliable and misleading.
This tree is too deep.
Analyze the depth of the tree. How many nodes are necessary to keep the tree stable and still accurate?
Decision about appropriate depth of the tree is made based on parameter CP and the error columns.
A rule of thumb is that tree needs to be pruned at a minimum level where rel error + xstd < xerror.
Check output cptable.
salaryFit$cptable
## CP nsplit rel error xerror xstd
## 1 0.24674996 0 1.0000000 1.0177861 0.1392688
## 2 0.18990577 1 0.7532500 0.7723039 0.1283051
## 3 0.02052199 2 0.5633443 0.5963625 0.1040882
## 4 0.01428090 3 0.5428223 0.6416193 0.1098044
## 5 0.01162544 4 0.5285414 0.6554035 0.1108470
## 6 0.01087042 5 0.5169159 0.6633079 0.1109284
## 7 0.01026659 8 0.4843047 0.6566277 0.1110065
## 8 0.01000000 10 0.4637715 0.6574350 0.1111501
cbind(salaryFit$cptable[,3]+salaryFit$cptable[,5],salaryFit$cptable[,4])
## [,1] [,2]
## 1 1.1392688 1.0177861
## 2 0.8815552 0.7723039
## 3 0.6674325 0.5963625
## 4 0.6526267 0.6416193
## 5 0.6393884 0.6554035
## 6 0.6278444 0.6633079
## 7 0.5953112 0.6566277
## 8 0.5749216 0.6574350
The rule of thumb shows that the tree should be pruned at 3 splits or at CP= 0.020522.
Show this tree.
prunedTree <- prune(salaryFit, cp = salaryFit$cptable[3,1])
printcp(prunedTree)
##
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
##
## Variables actually used in tree construction:
## [1] Hits Years
##
## Root node error: 53319113/263 = 202734
##
## n= 263
##
## CP nsplit rel error xerror xstd
## 1 0.246750 0 1.00000 1.01779 0.13927
## 2 0.189906 1 0.75325 0.77230 0.12831
## 3 0.020522 2 0.56334 0.59636 0.10409
The output of printcp() contains an important information about the number of variables participating in splits.
Extract that number.
rpart object, called frame. Each row of that data frame contains information about one node. Column var contains the name of the variable involved in the split. Terminal nodes are denoted as <leaf>tree_frame<-prunedTree$frame
tree_leaves_idx<-tree_frame$var=="<leaf>"
(used_variables<-unique(tree_frame$var[!tree_leaves_idx]))
## [1] "Years" "Hits"
It is also possible to extract importance of variables measured by “goodness of split” (how much the loss decreased each time the variable participated in split as main or surrogate split).
prunedTree$variable.importance
## Years Hits
## 13644470 10564157
Plot the pruned tree.
prp(prunedTree,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
This tree is smaller: it separates players by experience, then further splits more experienced group by number of hits.
Another way of pruning the tree is finding the best CP level just by xerror: find the level where it bottoms.
printcp(salaryFit)
##
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
##
## Variables actually used in tree construction:
## [1] Hits Years
##
## Root node error: 53319113/263 = 202734
##
## n= 263
##
## CP nsplit rel error xerror xstd
## 1 0.246750 0 1.00000 1.01779 0.13927
## 2 0.189906 1 0.75325 0.77230 0.12831
## 3 0.020522 2 0.56334 0.59636 0.10409
## 4 0.014281 3 0.54282 0.64162 0.10980
## 5 0.011625 4 0.52854 0.65540 0.11085
## 6 0.010870 5 0.51692 0.66331 0.11093
## 7 0.010267 8 0.48430 0.65663 0.11101
## 8 0.010000 10 0.46377 0.65744 0.11115
plotcp(salaryFit)
(best.CP = salaryFit$cptable[which.min(salaryFit$cptable[,"xerror"]),"CP"])
## [1] 0.02052199
prunedTree <- prune(salaryFit, cp = best.CP)
printcp(prunedTree)
##
## Regression tree:
## rpart(formula = Salary ~ Hits + Years, data = Hitters)
##
## Variables actually used in tree construction:
## [1] Hits Years
##
## Root node error: 53319113/263 = 202734
##
## n= 263
##
## CP nsplit rel error xerror xstd
## 1 0.246750 0 1.00000 1.01779 0.13927
## 2 0.189906 1 0.75325 0.77230 0.12831
## 3 0.020522 2 0.56334 0.59636 0.10409
prp(prunedTree,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
After the tree is pruned forecasting is done by averaging the output (salary) based on mean values of tree leafs.
This tree seems oversimplified.
Look at the residuals from this model, just as with a regular linear regression fit
plot(predict(prunedTree), jitter(resid(prunedTree)))
temp <- prunedTree$frame[prunedTree$frame$var == '<leaf>',]
axis(3, at = temp$yval, as.character(row.names(temp)))
mtext('leaf number', side = 3, line = 3)
abline(h = 0, lty = 2)
Note that node 6 has many players and has pretty good concentration.
But leaf 2 has outliers.
Find the mean squared error of prediction.
(prunedTree.MSE<-sum((Hitters$Salary-predict(prunedTree))^2))
## [1] 30037016
sum(resid(prunedTree)^2)
## [1] 30037016
Try to grow a tree without outliers.
To do this control size of buckets.
salaryFitControlled <- rpart(Salary~Hits+Years, data=Hitters,minbucket=15)
prp(salaryFitControlled,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
(salaryFitControlled.MSE<-sum((Hitters$Salary-predict(salaryFitControlled))^2))
## [1] 27915376
sum(resid(salaryFitControlled)^2)
## [1] 27915376
This tree has better MSE than pruned tree and has more meaningful distribution of nodes.
Answer the following questions using the controlled for size decision tree model:
(Skipped Code)
Show that answer to 3. is:
#1. Node #13: $518k
#2.
# * Increase hits above 118
# * Increase hits beyond 142 and wait for a year
#3. P(H<142|Y>=4.5,H>=118)=P(118<=H<142,Y>=4.5)/P(Y>=4.5,H>=118)=0.12/0.32
0.12/0.32
## [1] 0.375
Use library(caret) to build a tree.
set.seed(0)
ctrl <- trainControl(method = "cv", number = 10)
tree.slice <- train(Salary~Hits+Years, data=Hitters,
method = 'rpart', trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
tree.slice$results
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.02052199 350.4053 0.4366677 239.6130 99.24844 0.1845119 50.64152
## 2 0.18990577 369.2794 0.3555307 259.0561 101.99208 0.1557585 53.84690
## 3 0.24674996 441.4993 0.1217775 333.6233 93.62968 0.1564601 51.12298
prp(tree.slice$finalModel,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
Calculate MSE.
(caretTree.MSE<-sum(residuals(tree.slice$finalModel)^2))
## [1] 30037016
Compare the MSE values from 2 packages:
c(rpart=prunedTree.MSE,caret.rpart=caretTree.MSE)
## rpart caret.rpart
## 30037016 30037016
matplot(1:length(resid(prunedTree)),
cbind(resid(prunedTree),
residuals(tree.slice$finalModel),
resid(salaryFitControlled)),
pch=c(15,16,16),col=c("black","red","orange"),
ylab="Residuals",xlab="Index")
legend("topright",legend=c("rpartBestCP","caret","controlledSize"),pch=c(15,16,16),col=c("black","red","orange"))
The following examples are from the long introduction to rpart.
The dataset car90 contains a collection of variables from the April, 1990 Consumer Reports; it has 34 variables on 111 cars.
Variables tire size and model name are excluded because they are factors with a very large number of levels which creates a very long printout, and rim size because it is too good a predictor of price and leads to a less interesting illustration. (Tiny cars are cheaper and have small rims.)
cars <- car90[, -match(c("Rim", "Tires", "Model2"), names(car90))]
head(cars)
## Country Disp Disp2 Eng.Rev Front.Hd Frt.Leg.Room Frt.Shld
## Acura Integra Japan 112 1.8 2935 3.5 41.5 53.0
## Acura Legend Japan 163 2.7 2505 2.0 41.5 55.5
## Audi 100 Germany 141 2.3 2775 2.5 41.5 56.5
## Audi 80 Germany 121 2.0 2835 4.0 42.0 52.5
## BMW 325i Germany 152 2.5 2625 2.0 42.0 52.0
## BMW 535i Germany 209 3.5 2285 3.0 42.0 54.5
## Gear.Ratio Gear2 HP HP.revs Height Length Luggage Mileage Price
## Acura Integra 3.26 3.21 130 6000 47.5 177 16 NA 11950
## Acura Legend 2.95 3.02 160 5900 50.0 191 14 20 24760
## Audi 100 3.27 3.25 130 5500 51.5 193 17 NA 26900
## Audi 80 3.25 3.25 108 5300 50.5 176 10 27 18900
## BMW 325i 3.02 2.99 168 5800 49.5 175 12 NA 24650
## BMW 535i 2.80 2.85 208 5700 51.0 186 12 NA 33200
## Rear.Hd Rear.Seating RearShld Reliability Sratio.m Sratio.p
## Acura Integra 1.5 26.5 52.0 Much better NA 0.86
## Acura Legend 2.0 28.5 55.5 Much better NA 0.96
## Audi 100 3.0 31.0 55.0 <NA> NA 0.97
## Audi 80 1.0 28.0 52.0 <NA> NA 0.71
## BMW 325i 1.0 25.5 51.5 better NA 0.88
## BMW 535i 2.5 27.0 55.5 <NA> NA 0.78
## Steering Tank Trans1 Trans2 Turning Type Weight Wheel.base
## Acura Integra power 13.2 man.5 auto.4 37 Small 2700 102
## Acura Legend power 18.0 man.5 auto.4 42 Medium 3265 109
## Audi 100 power 21.1 man.5 auto.3 39 Medium 2935 106
## Audi 80 power 15.9 man.5 auto.3 35 Compact 2670 100
## BMW 325i power 16.4 man.5 auto.4 35 Compact 2895 101
## BMW 535i power 21.1 man.5 auto.4 39 Medium 3640 109
## Width
## Acura Integra 67
## Acura Legend 69
## Audi 100 71
## Audi 80 67
## BMW 325i 65
## BMW 535i 69
carfit <- rpart(Price/1000 ~ ., data=cars)
carfit
## n=105 (6 observations deleted due to missingness)
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 105 7118.26700 15.805220
## 2) Disp< 156 70 1491.62800 11.855870
## 4) Country=Brazil,Japan,Japan/USA,Korea,Mexico,USA 58 421.21470 10.318470
## 8) Type=Small 21 50.30983 7.629048 *
## 9) Type=Compact,Medium,Sporty,Van 37 132.80330 11.844890 *
## 5) Country=France,Germany,Sweden 12 270.72330 19.286670 *
## 3) Disp>=156 35 2351.19600 23.703910
## 6) HP.revs< 5550 24 980.27790 20.388710
## 12) Disp< 267.5 16 395.99670 17.820060 *
## 13) Disp>=267.5 8 267.58000 25.526000 *
## 7) HP.revs>=5550 11 531.63680 30.937090 *
prp(carfit,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
printcp(carfit)
##
## Regression tree:
## rpart(formula = Price/1000 ~ ., data = cars)
##
## Variables actually used in tree construction:
## [1] Country Disp HP.revs Type
##
## Root node error: 7118.3/105 = 67.793
##
## n=105 (6 observations deleted due to missingness)
##
## CP nsplit rel error xerror xstd
## 1 0.460146 0 1.00000 1.00373 0.16091
## 2 0.117905 1 0.53985 0.73010 0.11672
## 3 0.112343 2 0.42195 0.72191 0.11616
## 4 0.044491 3 0.30961 0.65157 0.12016
## 5 0.033449 4 0.26511 0.62365 0.12598
## 6 0.010000 5 0.23166 0.61966 0.13556
summary(carfit, cp = 0.1)
## Call:
## rpart(formula = Price/1000 ~ ., data = cars)
## n=105 (6 observations deleted due to missingness)
##
## CP nsplit rel error xerror xstd
## 1 0.46014608 0 1.0000000 1.0037274 0.1609054
## 2 0.11790530 1 0.5398539 0.7301001 0.1167165
## 3 0.11234341 2 0.4219486 0.7219094 0.1161613
## 4 0.04449133 3 0.3096052 0.6515713 0.1201572
## 5 0.03344936 4 0.2651139 0.6236470 0.1259838
## 6 0.01000000 5 0.2316645 0.6196641 0.1355606
##
## Variable importance
## Disp Disp2 Weight Tank HP Wheel.base Country
## 19 18 13 12 11 9 6
## HP.revs Gear2 Front.Hd Type Length Steering Width
## 4 3 1 1 1 1 1
##
## Node number 1: 105 observations, complexity param=0.4601461
## mean=15.80522, MSE=67.79302
## left son=2 (70 obs) right son=3 (35 obs)
## Primary splits:
## Disp < 156 to the left, improve=0.4601461, (0 missing)
## Disp2 < 2.55 to the left, improve=0.4601461, (0 missing)
## HP < 154 to the left, improve=0.4548845, (0 missing)
## Tank < 17.8 to the left, improve=0.4431325, (0 missing)
## Weight < 2890 to the left, improve=0.3912428, (0 missing)
## Surrogate splits:
## Disp2 < 2.55 to the left, agree=1.000, adj=1.000, (0 split)
## Weight < 3095 to the left, agree=0.914, adj=0.743, (0 split)
## HP < 139 to the left, agree=0.895, adj=0.686, (0 split)
## Tank < 17.95 to the left, agree=0.895, adj=0.686, (0 split)
## Wheel.base < 105.5 to the left, agree=0.857, adj=0.571, (0 split)
##
## Node number 2: 70 observations, complexity param=0.1123434
## mean=11.85587, MSE=21.30898
## left son=4 (58 obs) right son=5 (12 obs)
## Primary splits:
## Country splits as L-RRLLLLRL, improve=0.5361191, (0 missing)
## Tank < 15.65 to the left, improve=0.3805426, (0 missing)
## Weight < 2567.5 to the left, improve=0.3691043, (0 missing)
## Type splits as R-RLRR, improve=0.3649737, (0 missing)
## HP < 105.5 to the left, improve=0.3577873, (0 missing)
## Surrogate splits:
## Tank < 17.8 to the left, agree=0.857, adj=0.167, (0 split)
## Rear.Seating < 28.75 to the left, agree=0.843, adj=0.083, (0 split)
##
## Node number 3: 35 observations, complexity param=0.1179053
## mean=23.70391, MSE=67.17703
## left son=6 (24 obs) right son=7 (11 obs)
## Primary splits:
## HP.revs < 5550 to the left, improve=0.3569594, (0 missing)
## HP < 173.5 to the left, improve=0.3146835, (0 missing)
## Frt.Shld < 57.75 to the right, improve=0.2554028, (0 missing)
## Front.Hd < 3.25 to the right, improve=0.2278477, (0 missing)
## Type splits as RLR-RL, improve=0.2178764, (0 missing)
## Surrogate splits:
## Country splits as -R-RR----L, agree=0.886, adj=0.636, (0 split)
## Gear2 < 2.735 to the left, agree=0.886, adj=0.636, (0 split)
## Disp < 172.5 to the right, agree=0.800, adj=0.364, (0 split)
## Disp2 < 2.85 to the right, agree=0.800, adj=0.364, (0 split)
## Front.Hd < 3.25 to the right, agree=0.800, adj=0.364, (0 split)
##
## Node number 4: 58 observations
## mean=10.31847, MSE=7.262322
##
## Node number 5: 12 observations
## mean=19.28667, MSE=22.56027
##
## Node number 6: 24 observations
## mean=20.38871, MSE=40.84491
##
## Node number 7: 11 observations
## mean=30.93709, MSE=48.33062
Surrogate splits may be used when certain features of the input vector are missed, and the prediction procedure may get stuck in the certain node. Then in addition to the best “primary” split, every tree node may also be split to one or more other variables with nearly the same results.
The following plot shows the (observed-expected) cost of cars versus the predicted cost of cars based on the nodes/leaves in which the cars landed.
There appears to be more variability in node 7 than in some of the other leaves.
plot(predict(carfit), jitter(resid(carfit)))
temp <- carfit$frame[carfit$frame$var == '<leaf>',]
axis(3, at = temp$yval, as.character(row.names(temp)))
mtext('leaf number', side = 3, line = 3)
abline(h = 0, lty = 2)
The following graphs show how \(R^2\) improves and how the X Relative error drops with growing number of splits.
par(mfrow=c(1,2))
rsq.rpart(carfit)
##
## Regression tree:
## rpart(formula = Price/1000 ~ ., data = cars)
##
## Variables actually used in tree construction:
## [1] Country Disp HP.revs Type
##
## Root node error: 7118.3/105 = 67.793
##
## n=105 (6 observations deleted due to missingness)
##
## CP nsplit rel error xerror xstd
## 1 0.460146 0 1.00000 1.00373 0.16091
## 2 0.117905 1 0.53985 0.73010 0.11672
## 3 0.112343 2 0.42195 0.72191 0.11616
## 4 0.044491 3 0.30961 0.65157 0.12016
## 5 0.033449 4 0.26511 0.62365 0.12598
## 6 0.010000 5 0.23166 0.61966 0.13556
par(mfrow=c(1,1))
The Poisson splitting method attempts to extend rpart models to event rate data.
The model in this case is \(\lambda=f(x)\), where \(\lambda\) is an event rate and \(x\) is some set of predictors.
The solder data frame, is a dataset with 900 observations which are the results of an experiment varying 5 factors relevant to the wave-soldering procedure for mounting components on printed circuit boards.
The response variable, skips, is a count of how many solder skips appeared to a visual inspection.
The other variables are:
In this call, the rpart.control options are modified:
maxcompete = 2 means that only 2 other competing splits are listed (default is 4);cp = .05 means that a smaller tree will be built initially (default is .01).sfit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel,data = solder, method ='poisson',
control = rpart.control(cp = 0.004, maxcompete = 2))
printcp(sfit)
##
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel,
## data = solder, method = "poisson", control = rpart.control(cp = 0.004,
## maxcompete = 2))
##
## Variables actually used in tree construction:
## [1] Mask Opening PadType Panel Solder
##
## Root node error: 8788.2/900 = 9.7647
##
## n= 900
##
## CP nsplit rel error xerror xstd
## 1 0.3037885 0 1.00000 1.00270 0.0521095
## 2 0.1540759 1 0.69621 0.69973 0.0326983
## 3 0.1284762 2 0.54214 0.54514 0.0252752
## 4 0.0428654 3 0.41366 0.41711 0.0194107
## 5 0.0287271 4 0.37079 0.37616 0.0173988
## 6 0.0272254 5 0.34207 0.36314 0.0171739
## 7 0.0183899 6 0.31484 0.32327 0.0151030
## 8 0.0156026 7 0.29645 0.31862 0.0148515
## 9 0.0125093 8 0.28085 0.29398 0.0136167
## 10 0.0089906 10 0.25583 0.27232 0.0124124
## 11 0.0085255 11 0.24684 0.26565 0.0117998
## 12 0.0066697 13 0.22979 0.25594 0.0114715
## 13 0.0063814 14 0.22312 0.25094 0.0113376
## 14 0.0060658 15 0.21674 0.24973 0.0112412
## 15 0.0059467 16 0.21067 0.24438 0.0109590
## 16 0.0058737 17 0.20473 0.24440 0.0109517
## 17 0.0055449 18 0.19885 0.23646 0.0107538
## 18 0.0050760 19 0.19331 0.23148 0.0105358
## 19 0.0045152 20 0.18823 0.21829 0.0099479
## 20 0.0044702 21 0.18372 0.21508 0.0098090
## 21 0.0044302 22 0.17925 0.21459 0.0099189
## 22 0.0040000 23 0.17482 0.20869 0.0096200
prp(sfit,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
summary(sfit, cp = 0.1)
## Call:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel,
## data = solder, method = "poisson", control = rpart.control(cp = 0.004,
## maxcompete = 2))
## n= 900
##
## CP nsplit rel error xerror xstd
## 1 0.303788504 0 1.0000000 1.0027009 0.052109520
## 2 0.154075869 1 0.6962115 0.6997296 0.032698324
## 3 0.128476176 2 0.5421356 0.5451440 0.025275240
## 4 0.042865392 3 0.4136595 0.4171140 0.019410687
## 5 0.028727072 4 0.3707941 0.3761589 0.017398837
## 6 0.027225437 5 0.3420670 0.3631444 0.017173886
## 7 0.018389899 6 0.3148416 0.3232708 0.015103045
## 8 0.015602563 7 0.2964517 0.3186170 0.014851451
## 9 0.012509312 8 0.2808491 0.2939828 0.013616678
## 10 0.008990627 10 0.2558305 0.2723185 0.012412370
## 11 0.008525450 11 0.2468398 0.2656460 0.011799802
## 12 0.006669726 13 0.2297889 0.2559355 0.011471486
## 13 0.006381358 14 0.2231192 0.2509390 0.011337577
## 14 0.006065836 15 0.2167379 0.2497272 0.011241153
## 15 0.005946661 16 0.2106720 0.2443843 0.010959021
## 16 0.005873677 17 0.2047254 0.2444042 0.010951675
## 17 0.005544938 18 0.1988517 0.2364607 0.010753779
## 18 0.005075974 19 0.1933067 0.2314824 0.010535772
## 19 0.004515244 20 0.1882308 0.2182898 0.009947870
## 20 0.004470166 21 0.1837155 0.2150782 0.009808968
## 21 0.004430245 22 0.1792454 0.2145904 0.009918866
## 22 0.004000000 23 0.1748151 0.2086901 0.009619960
##
## Variable importance
## Mask Opening Solder PadType Panel
## 38 36 16 9 1
##
## Node number 1: 900 observations, complexity param=0.3037885
## events=4977, estimated rate=5.53 , mean deviance=9.764721
## left son=2 (600 obs) right son=3 (300 obs)
## Primary splits:
## Opening splits as LLR, improve=2669.769, (0 missing)
## Mask splits as LLRLR, improve=2162.010, (0 missing)
## Solder splits as LR, improve=1168.401, (0 missing)
##
## Node number 2: 600 observations, complexity param=0.1284762
## events=1531, estimated rate=2.552564 , mean deviance=4.927685
## left son=4 (420 obs) right son=5 (180 obs)
## Primary splits:
## Mask splits as LLRLR, improve=1129.0820, (0 missing)
## Opening splits as LR-, improve= 250.7655, (0 missing)
## Solder splits as LR, improve= 219.7641, (0 missing)
##
## Node number 3: 300 observations, complexity param=0.1540759
## events=3446, estimated rate=11.48308 , mean deviance=10.53956
## left son=6 (150 obs) right son=7 (150 obs)
## Primary splits:
## Mask splits as LLRRR, improve=1354.0590, (0 missing)
## Solder splits as LR, improve= 976.9158, (0 missing)
## PadType splits as RRRRLLRLRL, improve= 313.2002, (0 missing)
## Surrogate splits:
## Solder splits as LR, agree=0.6, adj=0.2, (0 split)
##
## Node number 4: 420 observations
## events=433, estimated rate=1.032889 , mean deviance=2.081981
##
## Node number 5: 180 observations
## events=1098, estimated rate=6.099428 , mean deviance=5.294992
##
## Node number 6: 150 observations
## events=680, estimated rate=4.534533 , mean deviance=4.350953
##
## Node number 7: 150 observations
## events=2766, estimated rate=18.42446 , mean deviance=7.701124
Note the splitting criterion. The improvement is \(Deviance_{parent}-(Deviance_R+Deviance_L)\), which is the likelihood ratio test for comparing two Poisson samples
fit.prune <- prune(sfit, cp = 0.1)
printcp(fit.prune)
##
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel,
## data = solder, method = "poisson", control = rpart.control(cp = 0.004,
## maxcompete = 2))
##
## Variables actually used in tree construction:
## [1] Mask Opening
##
## Root node error: 8788.2/900 = 9.7647
##
## n= 900
##
## CP nsplit rel error xerror xstd
## 1 0.30379 0 1.00000 1.00270 0.052110
## 2 0.15408 1 0.69621 0.69973 0.032698
## 3 0.12848 2 0.54214 0.54514 0.025275
## 4 0.10000 3 0.41366 0.41711 0.019411
prp(fit.prune,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=T) # display the node numbers, default is FALSE
The function prune() trims the tree fit to the cp = 0.10.
The same tree could have been created by specifying cp= 0.10 in the original call to rpart()
Prepare function for calculating residual mean square error of the model.
Let x be vector of residuals of the model.
Then mrse is calculated as:
rmse <- function(x) sqrt(mean(x^2))
Use rmse() to calculate measure of fit of the pruned tree.
treeRmse<-rmse(resid(fit.prune))
Calculate probability of zero count by the Poisson tree model for row 6 of data frame solder.
Function predict() returns estimated intensity of the Poisson process.
(preTree<-predict(fit.prune,newdata=solder[6,]))
## 6
## 1.032889
(pred0Tree<-dpois(0,lambda=preTree))
## [1] 0.3559772
In this section explore application of rules of regression tree to each row of observations.
library(plyr)
library(data.tree)
library(partykit)
Prepare a function creating matrix of all paths from root to leave on given tree.
rpartPaths<-function(x){
require(plyr)
require(partykit)
require(rpart)
require(data.tree)
dtTree<-as.Node(as.party(x)) # transform object into data.tree
myPaths<-lapply(dtTree$leaves, # find all paths from root to leave on the tree
function(z) t(as.matrix(as.numeric(FindNode(dtTree,z$name)$path))))
myPaths<-rbind.fill.matrix(myPaths) # create matrix with variable row lengths
myPaths
}
Analyze unpruned tree which has the following configuration:
plot(as.Node(as.party(sfit)))
Or:
plot(as.Node(as.party(sfit)),output = 'visNetwork')
Create matrix of paths.
(pts<-rpartPaths(sfit))
## 1 2 3 4 5 6 7
## [1,] 1 2 3 4 5 6 NA
## [2,] 1 2 3 4 5 7 NA
## [3,] 1 2 3 4 8 NA NA
## [4,] 1 2 3 9 10 NA NA
## [5,] 1 2 3 9 11 12 NA
## [6,] 1 2 3 9 11 13 14
## [7,] 1 2 3 9 11 13 15
## [8,] 1 2 16 17 18 19 NA
## [9,] 1 2 16 17 18 20 NA
## [10,] 1 2 16 17 21 NA NA
## [11,] 1 2 16 22 23 24 NA
## [12,] 1 2 16 22 23 25 NA
## [13,] 1 2 16 22 26 NA NA
## [14,] 1 27 28 29 30 NA NA
## [15,] 1 27 28 29 31 NA NA
## [16,] 1 27 28 32 33 NA NA
## [17,] 1 27 28 32 34 NA NA
## [18,] 1 27 35 36 37 NA NA
## [19,] 1 27 35 36 38 39 NA
## [20,] 1 27 35 36 38 40 NA
## [21,] 1 27 35 41 42 43 NA
## [22,] 1 27 35 41 42 44 NA
## [23,] 1 27 35 41 45 46 NA
## [24,] 1 27 35 41 45 47 NA
Create vector of all terminal nodes.
(lvs<-apply(pts,1,function(z) tail(na.omit(z),1)))
## [1] 6 7 8 10 12 14 15 19 20 21 24 25 26 30 31 33 34 37 39 40 43 44 46 47
Create table of numbers of observations in terminal nodes
(distrInLeaves<-table(sfit$where))
##
## 6 7 8 10 12 14 15 19 20 21 24 25 26 30 31 33 34 37 39 40 43 44 46 47
## 90 60 60 70 98 12 30 30 30 30 30 15 45 36 54 42 18 24 18 18 9 36 30 15
Create vector of predictions by unpruned tree.
Pred<-round(predict(sfit,newdata=solder),2)
Finally, collect in one matrix the data, predictions,terminal nodes and entire paths for each observation.
completeTreeMatrix<-cbind(solder,Pred,Where=sfit$where,pts[match(sfit$where,lvs),])
head(completeTreeMatrix)
## Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6 7
## 1 L Thick A1.5 W4 1 0 0.07 6 1 2 3 4 5 6 NA
## 2 L Thick A1.5 W4 2 0 0.07 6 1 2 3 4 5 6 NA
## 3 L Thick A1.5 W4 3 0 0.07 6 1 2 3 4 5 6 NA
## 4 L Thick A1.5 D4 1 0 0.07 6 1 2 3 4 5 6 NA
## 5 L Thick A1.5 D4 2 0 0.07 6 1 2 3 4 5 6 NA
## 6 L Thick A1.5 D4 3 0 0.07 6 1 2 3 4 5 6 NA
Explore switches at different levels of the tree.
completeTreeMatrix[58:62,]
## Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6 7
## 58 M Thick A1.5 L9 1 0 0.60 7 1 2 3 4 5 7 NA
## 59 M Thick A1.5 L9 2 0 0.60 7 1 2 3 4 5 7 NA
## 60 M Thick A1.5 L9 3 1 0.60 7 1 2 3 4 5 7 NA
## 61 S Thick A1.5 W4 1 1 3.08 31 1 27 28 29 31 NA NA
## 62 S Thick A1.5 W4 2 0 3.08 31 1 27 28 29 31 NA NA
At 61 Opening switched from “M” to “S” and pad type switched from “L9” to “W4”. Both changes resulted in jump of predicted values from 0.17 to 1.16.
Change of the path on the tree was from node 2 to node 27 on the second level.
completeTreeMatrix[88:92,]
## Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6 7
## 88 S Thick A1.5 L9 1 1 1.05 30 1 27 28 29 30 NA NA
## 89 S Thick A1.5 L9 2 1 1.05 30 1 27 28 29 30 NA NA
## 90 S Thick A1.5 L9 3 1 1.05 30 1 27 28 29 30 NA NA
## 91 L Thin A1.5 W4 1 1 0.73 10 1 2 3 9 10 NA NA
## 92 L Thin A1.5 W4 2 1 1.45 12 1 2 3 9 11 12 NA
At 91 switched back from node 27 to node 2, level 2 as a result of change in opening and pad type.
completeTreeMatrix[448:462,]
## Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6 7
## 448 S Thick A3 L9 1 0 1.05 30 1 27 28 29 30 NA NA
## 449 S Thick A3 L9 2 4 1.05 30 1 27 28 29 30 NA NA
## 450 S Thick A3 L9 3 4 1.05 30 1 27 28 29 30 NA NA
## 451 L Thin A3 W4 1 0 0.73 10 1 2 3 9 10 NA NA
## 452 L Thin A3 W4 2 2 1.45 12 1 2 3 9 11 12 NA
## 453 L Thin A3 W4 3 1 1.45 12 1 2 3 9 11 12 NA
## 454 L Thin A3 D4 1 0 0.73 10 1 2 3 9 10 NA NA
## 455 L Thin A3 D4 2 7 4.44 15 1 2 3 9 11 13 15
## 456 L Thin A3 D4 3 8 4.44 15 1 2 3 9 11 13 15
## 457 L Thin A3 L4 1 2 0.73 10 1 2 3 9 10 NA NA
## 458 L Thin A3 L4 2 2 4.44 15 1 2 3 9 11 13 15
## 459 L Thin A3 L4 3 3 4.44 15 1 2 3 9 11 13 15
## 460 L Thin A3 D6 1 0 0.73 10 1 2 3 9 10 NA NA
## 461 L Thin A3 D6 2 5 1.45 12 1 2 3 9 11 12 NA
## 462 L Thin A3 D6 3 3 1.45 12 1 2 3 9 11 12 NA
At 451 big change at level 2 as a result of change in opening, solder and pad type.
completeTreeMatrix[506:512,]
## Opening Solder Mask PadType Panel skips Pred Where 1 2 3 4 5 6 7
## 506 M Thin A6 W9 2 0 6.23 21 1 2 16 17 21 NA NA
## 507 M Thin A6 W9 3 1 6.23 21 1 2 16 17 21 NA NA
## 508 M Thin A6 L9 1 3 6.23 21 1 2 16 17 21 NA NA
## 509 M Thin A6 L9 2 7 6.23 21 1 2 16 17 21 NA NA
## 510 M Thin A6 L9 3 15 6.23 21 1 2 16 17 21 NA NA
## 511 S Thin A6 W4 1 16 27.00 46 1 27 35 41 45 46 NA
## 512 S Thin A6 W4 2 34 27.00 46 1 27 35 41 45 46 NA
At 511 big change as a result of switch in opening.
It is possible to select pruning level based on intensity of switches.
apply(completeTreeMatrix[,10:15],2,
function(z) sum(diff(na.omit(z))!=0)/length(na.omit(z)))
## 2 3 4 5 6 7
## 0.02111111 0.02333333 0.05888889 0.25444444 0.13435701 0.02380952
Noise jumps at level 5.
Plot switches at each level.
plot(completeTreeMatrix[,10],type="l",main="Level 2",ylab="Node #")
plot(completeTreeMatrix[,11],type="l",main="Level 3",ylab="Node #")
plot(completeTreeMatrix[,12],type="l",main="Level 4",ylab="Node #")
plot(completeTreeMatrix[,13],type="l",main="Level 5",ylab="Node #")
plot(completeTreeMatrix[,14],type="l",main="Level 6",ylab="Node #")
plot(completeTreeMatrix[,15],type="l",main="Level 7",ylab="Node #")
Note that at level 6 there is a big switch at 511 from node 21 to node 46 resulting in change in predicted counts from 6.23 to 27. But this level is still not necessary because closer to the root at level 2 there is switch from node 2 to 27.
Prune tree to 4 levels.
This corresponds to cp level of 0.0287271.
printcp(sfit)
##
## Rates regression tree:
## rpart(formula = skips ~ Opening + Solder + Mask + PadType + Panel,
## data = solder, method = "poisson", control = rpart.control(cp = 0.004,
## maxcompete = 2))
##
## Variables actually used in tree construction:
## [1] Mask Opening PadType Panel Solder
##
## Root node error: 8788.2/900 = 9.7647
##
## n= 900
##
## CP nsplit rel error xerror xstd
## 1 0.3037885 0 1.00000 1.00270 0.0521095
## 2 0.1540759 1 0.69621 0.69973 0.0326983
## 3 0.1284762 2 0.54214 0.54514 0.0252752
## 4 0.0428654 3 0.41366 0.41711 0.0194107
## 5 0.0287271 4 0.37079 0.37616 0.0173988
## 6 0.0272254 5 0.34207 0.36314 0.0171739
## 7 0.0183899 6 0.31484 0.32327 0.0151030
## 8 0.0156026 7 0.29645 0.31862 0.0148515
## 9 0.0125093 8 0.28085 0.29398 0.0136167
## 10 0.0089906 10 0.25583 0.27232 0.0124124
## 11 0.0085255 11 0.24684 0.26565 0.0117998
## 12 0.0066697 13 0.22979 0.25594 0.0114715
## 13 0.0063814 14 0.22312 0.25094 0.0113376
## 14 0.0060658 15 0.21674 0.24973 0.0112412
## 15 0.0059467 16 0.21067 0.24438 0.0109590
## 16 0.0058737 17 0.20473 0.24440 0.0109517
## 17 0.0055449 18 0.19885 0.23646 0.0107538
## 18 0.0050760 19 0.19331 0.23148 0.0105358
## 19 0.0045152 20 0.18823 0.21829 0.0099479
## 20 0.0044702 21 0.18372 0.21508 0.0098090
## 21 0.0044302 22 0.17925 0.21459 0.0099189
## 22 0.0040000 23 0.17482 0.20869 0.0096200
prunedFit4<-prune(sfit,cp=0.0287271)
prp(prunedFit4,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=T) # display the node numbers, default is FALSE
plot(as.Node(as.party(prunedFit4)))
plot(as.Node(as.party(prunedFit4)),output = 'visNetwork')
Recall example with these data in Linear and Nonlinear Model.
There we showed that Poisson regression does not fit well these data: there is significant overdispersion.
Instead, we used negative binomial regression.
Fit negative binomial regression and compare residuals of both models.
modnb <- glm.nb(skips ~ .,solder)
summary(modnb)
##
## Call:
## glm.nb(formula = skips ~ ., data = solder, init.theta = 4.52811339,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7047 -1.0109 -0.3921 0.4480 2.8869
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.29859 0.13202 -9.837 < 2e-16 ***
## OpeningM 0.50353 0.07932 6.348 2.18e-10 ***
## OpeningS 1.91104 0.07111 26.876 < 2e-16 ***
## SolderThin 0.93513 0.05323 17.567 < 2e-16 ***
## MaskA3 0.58383 0.09592 6.087 1.15e-09 ***
## MaskA6 2.26096 0.10101 22.384 < 2e-16 ***
## MaskB3 1.20651 0.09572 12.605 < 2e-16 ***
## MaskB6 1.98172 0.09158 21.638 < 2e-16 ***
## PadTypeD6 -0.46189 0.11145 -4.144 3.41e-05 ***
## PadTypeD7 -0.03182 0.10584 -0.301 0.763655
## PadTypeL4 0.38119 0.10177 3.745 0.000180 ***
## PadTypeL6 -0.57860 0.11327 -5.108 3.25e-07 ***
## PadTypeL7 -0.36569 0.11006 -3.323 0.000891 ***
## PadTypeL8 -0.15882 0.10734 -1.480 0.138953
## PadTypeL9 -0.56554 0.11306 -5.002 5.67e-07 ***
## PadTypeW4 -0.19851 0.10783 -1.841 0.065630 .
## PadTypeW9 -1.56332 0.13538 -11.547 < 2e-16 ***
## Panel2 0.29574 0.06322 4.678 2.90e-06 ***
## Panel3 0.33380 0.06298 5.300 1.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.5281) family taken to be 1)
##
## Null deviance: 4097.6 on 899 degrees of freedom
## Residual deviance: 1012.1 on 881 degrees of freedom
## AIC: 3679.5
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.528
## Std. Err.: 0.518
##
## 2 x log-likelihood: -3639.514
matplot(1:length(modnb$residuals),cbind(modnb$residuals,resid(sfit)),type="p",pch=c(1,19),
xlab="Index",ylab="Residuals",main="Comparison of Tree Regression and NB Regression")
legend("bottomleft",legend=c("NB","Tree"),col=c("black","red"),lty=1,lwd=3)
What is the sign of overdispersion in the summary of the model?
Calculate residual mean square error of the negative binomial model.
nbRmse<-rmse(modnb$residuals)
Calculate probability of zero count by the negative binomial model for row 6 of data frame solder.
mu<-predict(modnb,solder[6,],type="response")
pred0NB<-dnbinom(0,mu=mu,size=modnb$theta)
Compare mrse and probabilities of zero count for row 6 of the data given by the 2 models.
c(Tree_RMSE=treeRmse,NB_RMSE=nbRmse)
## Tree_RMSE NB_RMSE
## 5.815413 1.068110
c(tree_0Probability=pred0Tree,NB_0Probability=pred0NB)
## tree_0Probability NB_0Probability
## 0.3559772 0.6935889
Predict returns of exchange traded fund SPY representing S&P 500 with a group of stock returns of companies in the index.
Select year 2014.
datapath<-"Your path to the folder with the data"
SPYPortf<-read.csv(paste(dataPath,"spyPortfolio.csv",sep="/"))
head(SPYPortf,3)
## SPLS.A MTB.A UNM.A VLO.A AMZN.A ADBE.A CSX.A PG.A CMA.A
## 1 13.35805 106.3561 31.95808 44.67762 397.97 59.29 26.06945 71.50668 44.10851
## 2 13.52942 106.4948 31.96739 44.21176 396.44 59.16 26.23555 71.42678 44.32504
## 3 13.13528 106.1620 31.92084 44.64179 393.63 58.12 26.04176 71.59547 44.24972
## PEP.A DNB.A MDLZ.A SYY.A VIAB.A MDT.A T.A CHK.A
## 1 74.31774 113.9543 32.85969 32.90575 79.78655 53.40803 28.78814 24.55435
## 2 74.44447 115.0964 32.80304 33.02484 79.46537 54.43439 28.66459 24.36987
## 3 74.48068 113.7090 32.59536 32.97904 78.66707 55.25548 28.79637 24.16694
## MRO.A TMK.A MU.A AABA.A MON.A AKAM.A HSY.A BAC.A WDC.A
## 1 32.30710 51.62000 21.66 39.59 108.2317 46.53 88.48643 15.44962 75.43743
## 2 31.94617 51.72000 20.97 40.12 108.2690 46.45 88.37573 15.74710 75.98487
## 3 31.86289 51.53333 20.67 39.93 107.6455 46.11 88.12668 15.98700 75.62904
## EL.A SPY.A
## 1 70.42357 170.5143
## 2 70.36625 170.4863
## 3 70.77708 169.9923
Create daily log returns of all stocks and SPY.
Make daily log returns of all stock prices lagged one day relative to the daily log returns of SPY.
SPYPortf<-log(SPYPortf)
SPYPortf<-apply(SPYPortf,2,diff)
Calculate meta-features as PCA factors.
SPYPortfPCA<-princomp(SPYPortf[,-28])
SPYPoptf.factors<-SPYPortfPCA$scores[,1:3]
SPYPortfFactors<-as.data.frame(cbind(SPYPoptf.factors,SPY=SPYPortf[,28]))
head(SPYPortfFactors,3)
## Comp.1 Comp.2 Comp.3 SPY
## 1 -0.006240906 0.013539045 0.017551027 -0.0001639528
## 2 -0.025357058 -0.008313471 -0.023439077 -0.0029021754
## 3 0.066077676 0.012254176 0.009869272 0.0061230653
Grow regression tree.
tsfit <- rpart(SPY ~.,data = SPYPortfFactors, method ='anova',
control = rpart.control(cp = 0.001))
printcp(tsfit)
##
## Regression tree:
## rpart(formula = SPY ~ ., data = SPYPortfFactors, method = "anova",
## control = rpart.control(cp = 0.001))
##
## Variables actually used in tree construction:
## [1] Comp.1 Comp.2 Comp.3
##
## Root node error: 0.012407/250 = 4.9629e-05
##
## n= 250
##
## CP nsplit rel error xerror xstd
## 1 0.5154147 0 1.00000 1.00585 0.119479
## 2 0.1411214 1 0.48459 0.50674 0.060971
## 3 0.1041148 2 0.34346 0.42592 0.057085
## 4 0.0482427 3 0.23935 0.26815 0.033231
## 5 0.0164834 4 0.19111 0.22097 0.021100
## 6 0.0162673 5 0.17462 0.21255 0.020640
## 7 0.0068456 6 0.15836 0.18700 0.018610
## 8 0.0037203 7 0.15151 0.19239 0.019007
## 9 0.0033124 8 0.14779 0.19240 0.019065
## 10 0.0025532 9 0.14448 0.19534 0.019077
## 11 0.0022440 10 0.14192 0.19231 0.019065
## 12 0.0021753 11 0.13968 0.19647 0.019319
## 13 0.0017223 12 0.13750 0.19724 0.019229
## 14 0.0014699 13 0.13578 0.19256 0.019074
## 15 0.0011689 14 0.13431 0.19339 0.019336
## 16 0.0010041 15 0.13314 0.19107 0.019345
## 17 0.0010000 16 0.13214 0.19104 0.019316
prp(tsfit,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
Prune the tree.
(best.CP = tsfit$cptable[which.min(tsfit$cptable[,"xerror"]),"CP"])
## [1] 0.006845631
In order to make predictions less trivial prune a little less than suggested by best.CP. 0.0022440
prunedTsFit<-prune(tsfit,cp=0.0037203)
prp(prunedTsFit,extra=101, # display the number of observations that fall in the node
branch=.5, # change angle of branch lines
shadow.col="gray", # shadows under the leaves
branch.lty=3, # draw branches using dotted lines
split.cex=1.2, # make the split text larger than the node text
split.prefix="is ", # put "is " before split text
split.suffix="?", # put "?" after split text
split.box.col="lightgray", # lightgray split boxes (default is white)
split.border.col="darkgray", # darkgray border on split boxes
split.round=.5,
nn=TRUE) # display the node numbers, default is FALSE
Interpret the tree.
Create matrix of paths.
(tsSPY<-rpartPaths(prunedTsFit))
## 1 2 3 4 5
## [1,] 1 2 3 NA NA
## [2,] 1 2 4 5 NA
## [3,] 1 2 4 6 NA
## [4,] 1 7 8 9 NA
## [5,] 1 7 8 10 NA
## [6,] 1 7 11 12 13
## [7,] 1 7 11 12 14
## [8,] 1 7 11 15 NA
plot(as.Node(as.party(prunedTsFit)))
Create vector of all terminal nodes.
(lvsSPY<-apply(tsSPY,1,function(z) tail(na.omit(z),1)))
## [1] 3 5 6 9 10 13 14 15
Create table of numbers of observations in terminal nodes
table(prunedTsFit$where)
##
## 3 5 6 9 10 13 14 15
## 17 18 34 67 55 7 43 9
Create vector of predictions by pruned tree.
PredSPY<-round(predict(prunedTsFit,newdata=SPYPortfFactors),6)
Finally, collect in one matrix the data, predictions,terminal nodes and entire paths for each observation.
completeMatrixSPY<-cbind(round(SPYPortfFactors,4),PredSPY,Where=prunedTsFit$where,
tsSPY[match(prunedTsFit$where,lvsSPY),])
head(completeMatrixSPY,7)
## Comp.1 Comp.2 Comp.3 SPY PredSPY Where 1 2 3 4 5
## 1 -0.0062 0.0135 0.0176 -0.0002 0.000370 9 1 7 8 9 NA
## 2 -0.0254 -0.0083 -0.0234 -0.0029 -0.003700 6 1 2 4 6 NA
## 3 0.0661 0.0123 0.0099 0.0061 0.007353 14 1 7 11 12 14
## 4 0.0376 0.0217 -0.0376 0.0002 0.007353 14 1 7 11 12 14
## 5 -0.0090 -0.0038 -0.0248 0.0007 0.000370 9 1 7 8 9 NA
## 6 0.0144 -0.0041 -0.0161 0.0027 0.002972 10 1 7 8 10 NA
## 7 -0.0906 -0.0104 -0.0199 -0.0134 -0.015175 3 1 2 3 NA NA
Analyze the pattern of the first 7 days in the period.
All leaves are ordered by the size of predicted return from -0.015 in node 3 to almost 0 in node 9, to 0.016 in node 15.
Rows 3, 4, 5, 6 ended in nodes with positive predicted returns.
Row 7 ended in negative predicted return.
Was it possible to see it coming?
First, note that the big divide is between nodes 2 and 7. On the side of node 2 return is always negative. On the side of node 7 it zero or positive.
In rows 3-6 trajectories go through node 7 and in row 7 there is a switch to node 2.
While going through node 7 trajectories in rows 3, 4 went through node 11 which guarantees positive return. But on day 5 trajectory switched from node 11 to node 8 which can lead to one slightly positive node (10), and one neutral node (9).
On days 5 and 6 trajectories went through node 8 to non-negative terminal nodes (9, 10), but the process was only a switch away from turning to the negative side which happened on day 7.
Check column SPY in the table.
Actual returns had the same signs as predictions.
In general the tree regression model predicted actual returns reasonably well.
plot(completeMatrixSPY[,"PredSPY"],type="l",col="blue",lwd=3,
ylim=range(completeMatrixSPY[,"SPY"]))
lines(completeMatrixSPY[,"SPY"],col="orange")
legend("top",legend=c("Predicted","Actual"),col=c("blue","orange"),lwd=c(2,1))
plot(completeMatrixSPY[,"PredSPY"],completeMatrixSPY[,"SPY"])
Consider the example analysed in the workshop Linear Regression with Large Number of Predictors.
Simulate data
N = 500
set.seed(0)
Epsilon<-rnorm(N,0,1)
X<-rnorm(N*N,0,2)
dim(X)<-c(N,N)
colnames(X)<-paste0("X",1:N)
slopesSet<-runif(N,1,3)
Y<-sapply(2:N,function(z) 1+X[,1:z]%*%slopesSet[1:z]+Epsilon)
head(X[,1:5])
## X1 X2 X3 X4 X5
## [1,] -1.2527365 -0.5737031 0.1575234 0.88869164 -1.0643804
## [2,] 0.9626707 3.6822138 -0.1028412 0.02385876 -0.7399334
## [3,] 3.3905422 -0.3135286 -0.5215483 -0.01856009 2.3417986
## [4,] -3.5224526 -2.7796053 3.1325364 -0.60475511 -0.9291356
## [5,] 0.3960260 -2.9462080 -0.7440302 0.98471004 -0.6042920
## [6,] 0.7946982 -0.1390379 3.4838889 -1.20543924 2.7827418
Fit linear regression model with 440 regressors.
m = 440
completeModelDataFrame <- data.frame(Y=Y[,m-1],X[,1:m])
m440 <- lm(Y~.,data=completeModelDataFrame)
We reduced the number of regressors in completeModelDataFrame in order to ensure that on each iteration of 10-fold cross validation the number of regressors is less than the number of observations.
The residual standard error of the model is
summary(m440)$sigma
## [1] 0.9167553
Apply regression tree to these data.
treeFit <- rpart(Y~., data=completeModelDataFrame)
printcp(treeFit)
##
## Regression tree:
## rpart(formula = Y ~ ., data = completeModelDataFrame)
##
## Variables actually used in tree construction:
## [1] X106 X113 X131 X157 X166 X183 X2 X204 X221 X232 X248 X275 X285 X31 X325
## [16] X326 X335 X350 X354 X356 X362 X375 X382 X418 X47 X66 X68 X71 X78 X86
##
## Root node error: 3764475/500 = 7528.9
##
## n= 500
##
## CP nsplit rel error xerror xstd
## 1 0.040633 0 1.00000 1.0027 0.059277
## 2 0.039101 1 0.95937 1.1287 0.067913
## 3 0.034782 2 0.92027 1.1412 0.068250
## 4 0.030455 3 0.88548 1.2193 0.073437
## 5 0.028779 4 0.85503 1.3002 0.076320
## 6 0.027717 5 0.82625 1.3399 0.077881
## 7 0.027244 6 0.79853 1.3550 0.078162
## 8 0.026064 8 0.74404 1.3717 0.078681
## 9 0.024455 10 0.69192 1.3900 0.079635
## 10 0.023939 11 0.66746 1.3994 0.081147
## 11 0.021970 13 0.61959 1.4132 0.082570
## 12 0.020786 14 0.59762 1.4611 0.086584
## 13 0.020574 15 0.57683 1.4632 0.086215
## 14 0.019686 16 0.55626 1.4712 0.086204
## 15 0.019472 17 0.53657 1.4673 0.086224
## 16 0.018328 18 0.51710 1.5050 0.089524
## 17 0.017346 19 0.49877 1.5116 0.089904
## 18 0.016112 20 0.48142 1.5396 0.091424
## 19 0.016005 21 0.46531 1.5422 0.091847
## 20 0.014976 22 0.44931 1.5513 0.092380
## 21 0.014302 23 0.43433 1.5635 0.093042
## 22 0.013641 24 0.42003 1.5920 0.093758
## 23 0.012147 25 0.40639 1.5811 0.094859
## 24 0.011826 26 0.39424 1.6007 0.094303
## 25 0.011815 27 0.38242 1.6008 0.094356
## 26 0.011456 28 0.37060 1.6014 0.094335
## 27 0.011242 29 0.35914 1.6032 0.094611
## 28 0.010000 30 0.34790 1.6144 0.094939
plotcp(treeFit)
The outputs of printcp and plotcp show that trees can’t handle this data at all.
Though relative error can be reduced up to 35% of root error, xerror (cross validation error) increases almost monotonically.
It means that any split implies overfitting and the best tree-based prediction model is constant.
The RSE of the constant prediction is
sd(completeModelDataFrame$Y)
## [1] 86.85642
Use package caret to run cross validation.
ctrl <- trainControl(method = "cv", number = 10)
lmTrain <- train(Y~.,data=completeModelDataFrame,
method = 'lm', trControl = ctrl)
lmTrain$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 6.4678 0.9939519 5.234991 1.898774 0.004034775 1.503631
Linear method displayed much better predictive quality than trees in this example.
Trees could not deal with this complex model with so many predictors.
Claudia Perlich et al. presented a large-scale experimental comparison of trees and linear models methods in the paper Tree Induction vs. Logistic Regression: A Learning-Curve Analysis.
Authors examined several dozens large, two-class data sets, ranging from roughly one thousand examples to two million examples.
The results of the study show the following.